Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Plug file descriptor leaks #643

Merged
merged 12 commits into from
Mar 13, 2023

Conversation

ssgier
Copy link
Contributor

@ssgier ssgier commented Feb 23, 2023

Issue Number: #642

Objective of pull request: fix file descriptor leaks.

Pull request checklist

Your PR fulfills the following requirements:

  • Issue created that explains the change and why it's needed
  • Tests are part of the PR (for bug fixes / features)
  • Docs reviewed and added / updated if needed (for bug fixes / features)
  • PR conforms to Coding Conventions
  • PR applys BSD 3-clause or LGPL2.1+ Licenses to all code files
  • Lint (flakeheaven lint src/lava tests/) and (bandit -r src/lava/.) pass locally
  • Build tests (pytest) passes locally

Pull request type

Please check your PR type:

  • Bugfix
  • Feature
  • Code style update (formatting, renaming)
  • Refactoring (no functional changes, no api changes)
  • Build related changes
  • Documentation changes
  • Other (please describe):

What is the current behavior?

  • File descriptors are leaking in multiple places. Setting high ulimit is necessary for successful pytest run.

What is the new behavior?

  • File descriptors are not leaking anymore. Setting ulimit is not necessary for successful pytest run anymore.

Does this introduce a breaking change?

  • Yes
  • No

Supplemental information

Key points all involve addressing sub-optimal usage of the Python multiprocessing library:

  • ensuring close is called on shared memory file handles (not the same as unlink, which is invoked by the shutdown of the manager)
  • closing pipes
  • closing resource handles to child processes

Note: the fix works for Linux. For other OSes it would have to be checked as well.

@mgkwill
Copy link
Contributor

mgkwill commented Feb 24, 2023

Thanks for this PR/bug fix. I'll take a look tomorrow and give more feedback, in the mean time I've added a few reviewers.

1 similar comment
@mgkwill
Copy link
Contributor

mgkwill commented Feb 24, 2023

Thanks for this PR/bug fix. I'll take a look tomorrow and give more feedback, in the mean time I've added a few reviewers.

Copy link
Contributor

@mathisrichter mathisrichter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks so much for taking this on, @ssgier! This has been a problem from the start. If this indeed solves the problem, it would be an amazing contribution to Lava that makes a lot of lives easier!

I made a few renaming and style suggestions, nothing major.

Do you see a way of writing a unit test for this new functionality? One that ideally fails with the old implementation and passes with the new?

@ssgier
Copy link
Contributor Author

ssgier commented Feb 24, 2023

Thanks for reviewing @mathisrichter!

I will implement your suggestions during the coming days and will also think of a way to write some specific tests.

Some follow-up information

Below is a screenshot of my terminal emulator, showing proof of concept:
Screenshot_2023-02-24_19-52-56
This does the following:

  • define an alias to run test_learning_rule and filter for relevant output
  • set a low ulimit for open file descriptors
  • run test with main branch -> fails with "Too many open files"
  • run test with fix branch -> passes
  • set higher ulimit for open file descriptors
  • run test with main branch -> passes now

This means that with low ulimit, only the fix branch avoids the files error, but with high ulimit, both avoid it. This result is deterministic. It can be run any number of times with the same result.

I am running:

  • Python 3.10.9
  • Linux kernel 6.1.12

However: I tested the fix today on my MacBook (macOS Ventura 13.2) and there it does not work. After going back to the Linux machine and taking a deeper look, I noticed that if I call the tests in a specific way (by repeating the same test a large number of times), some descriptors still leak, although much more slowly, eventually leading to the same error.

Conclusion: This change fixes some leaks, but not all of them yet. At least one remains. I will try to track them all down and come back with an update soon.

@mathisrichter
Copy link
Contributor

@ssgier Thanks for that deeper analysis! I brought up your PR internally and learned that there may be tests for this error in some branch. I asked @joyeshmishra to add them as a comment to this thread.
But I don't know the details of what those tests cover, so in the meantime, please continue with your own investigation.

@ssgier
Copy link
Contributor Author

ssgier commented Feb 26, 2023

Update: Found some more leaks and committed a draft of a fix here.

New behavior: With these fixes applied, all file descriptor leaks appear to be resolved on both Linux and macOs and the whole test suite runs through with low ulimit.

Next steps: I will implement @mathisrichter's suggestions and also write an explicit test. It turns out that there is a simple way of doing this:

import psutil
p = psutil.Process()
num_fds_before = p.num_fds()

<run subject under test>

num_fds_after = p.num_fds()
assert num_fds_after == num_fds_before

Details

  1. The shared memory manager instance opens pipes internally, which are only closed when the Python object is reclaimed.
  2. Some of the ports are never joined by the runtime.
  3. This one is specific to macOS: the multiprocessing semaphore opens a special file descriptor: the POSIX Semaphore (type PSXSEM when running lsof). This file descriptor is only closed when the Python object is reclaimed.
  4. Threads were leaking in pypychannel. They waited on the semaphore and never got to check the self._done condition. This is not a file descriptor leak, but it is related to fixing item 3, because the threads have a reference to the semaphore object. Cleanest would be to also join the threads, but this slows down the stopping process of the runtime.

I tried to keep the fixes as non-intrusive as possible. One shortcoming is that some of them rely on objects being reclaimed, but it works in practice and from what I can see, it does not break existing functionality.

Copy link
Contributor

@joyeshmishra joyeshmishra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this contribution. Looks great. Expecting you have verified on all three platforms before merge. Small comment on code otherwise looks great,

@ssgier
Copy link
Contributor Author

ssgier commented Mar 8, 2023

@joyeshmishra thanks for reviewing! I did verification on Linux and macOS but not on Windows. Is resource leakage also a problem on Windows? It looked like a problem specific to Unix-like systems.

@ssgier
Copy link
Contributor Author

ssgier commented Mar 8, 2023

As discussed separately with @mathisrichter, removed the now obsolete hint from README and tutorial. The tag referred in the README would have to be updated in the next release.

Copy link
Contributor

@harryliu-intel harryliu-intel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, thanks.

@mathisrichter mathisrichter merged commit 84e5ea4 into lava-nc:main Mar 13, 2023
monkin77 pushed a commit to monkin77/thesis-lava that referenced this pull request Jul 12, 2024
* Plug file descriptor leaks

* Plug more leaks

* Improve code style

* Check higher wait time on CI run

* Better fix for semaphore logic

* Add unit test for file descriptor leakage

* Let threads terminate asynchronously

* Disable leakage test on Windows (not applicable)

* Add more type hints

* Improve naming

* Remove hint about too many open files error
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

File descriptors leaking with multiprocessing lib (error: "Too many open files")
6 participants